arXivrl 504.05809V 1 [cs.CV] 22 Apr 2015 


LOAD: Local Orientation Adaptive Descriptor for Texture and Material 

Classification 


Xianbiao Guoying Zhao^, Linlin Shen'’, Qingquan Li'’, Matti Pietikainen'* 

^Center for Machine Vision Research, University of Oulu, PO Box 4500, FIN-90014, Finland. E-mails: qixianbiao@gmail.com, 

gyzhao@ee.oulu.fi, mkp@ee.oulu.fi 

^Shenzhen University, Shenzhen 518000, China. E-mails: llshen@szu.edu.cn, liqq@szu.edu.cn 


Abstract 

In this paper, we propose a novel local feature, called Local Orientation Adaptive Descriptor (LOAD), to capture 
regional texture in an image. In LOAD, we proposed to define point description on an Adaptive Coordinate Sys¬ 
tem (ACS), adopt a binary sequence descriptor to capture relationships between one point and its neighbors and use 
multi-scale strategy to enhance the discriminative power of the descriptor. The proposed LOAD enjoys not only dis¬ 
criminative power to capture the texture information, but also has strong robustness to illumination variation and image 
rotation. Extensive experiments on benchmark data sets of texture classification and real-world material recognition 
show that the proposed LOAD yields the state-of-the-art performance. It is worth to mention that we achieve a 65.4% 
classification accuracy- which is, to the best of our knowledge, the highest record by far -on Flickr Material Database 
by using a single feature. Moreover, by combining LOAD with the feature extracted by Convolutional Neural Net¬ 
works (CNN), we obtain significantly better performance than both the LOAD and CNN. This result confirms that the 
LOAD is complementary to the learning-based features. 

Keywords: Local Orientation Adaptive Descriptor, Texture Classification, Material Recognition, Improved Fisher 
Vector, Convolutional Neural Network 


1. Introduction 

Visual image classification |[3T] [32l [Tsl |29l [9l [lU is a 
challenging problem in computer vision, especially un¬ 
der multiple sources of image transformations, e.g. ro¬ 
tation, illumination, affine and scale variations, etc. The 
Bag-of-Words (BoW) 0 model, as a powerful interme¬ 
diate image representation of images, is the most popu¬ 
lar approach in visual categorization in the past ten years. 
In BoW model, the low-level feature extraction and mid¬ 
level feature encoding are two most important problems. 
In the past few years, some advanced middle-level fea¬ 
ture encoding approaches has been proposed, such as 
Locality-constrained Linear Coding (LLC) EU, Vector 
of Locally Aggregated Descriptors (VLAD) |[T3]| and Im¬ 
proved Fisher Vector (IFV) 1^ . These encoding meth¬ 


ods have greatly put forward the development of BoW 
approach. However, on the other side, the development 
of low-level feature extraction is slow. 

Earlier works on texture description mainly focused on 
capturing global texture information (e.g. GIST 1^ . Ga¬ 
bor), or fine texture micro-structure (e.g. MRS filter bank 
(321, Local Binary Pattern (LBP) (23|). The global texture 
descriptors can well capture global texture information, 
but miss most of texture details. For instance, the GIST 
is good at capturing the spatial layout of scene, but per¬ 
forms poor on simple texture classification task in which 
the micro-structures are important. These fine texture de¬ 
scriptors defined on very small patches (e.g. 3 x 3 or 
5x5) can well capture small texture structures, but ig¬ 
nore global texture information. For example, the LBP 
and MRS perform well on some simple texture data sets. 
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but work poor on complex material data sets in which re¬ 
gional texture information is important. There were some 
works that tried to bridge the gap between these two types 
of features. However, as we will discuss later, these fea¬ 
tures may suffer from some limitations, such as sensitive¬ 
ness to image transformations or limited discriminative 
power. 

This paper aims to provide a powerful regional texture 
descriptor. To this end, we propose a novel Local Ori¬ 
entation Adaptive Descriptor (LOAD). The proposed de¬ 
scriptor has two important advantages, (i) strong regional 
texture discrimination: the strong texture discrimination 
comes from two aspects. Firstly, on single point, we adopt 
a binary sequence description that owns stronger discrim¬ 
inative power than the Gradient Orientation in (e.g. SIFT 
1211 . MORGH 121) and Local Intensity Order (e.g. LIOP 
|[36l ). Secondly, to enhance the discriminative power of 
the descriptor, we propose to use a multi-scale description 
to capture multi-scale texture information, (ii) robustness 
to image rotation and illumination variation: Due to that 
the LOAD is defined on an Adaptive Coordinate System, 
the LOAD is robust to image rotation. Meanwhile, the bi¬ 
nary sequence description used in the LOAD affiliates the 
feature with great robustness to illumination variation. 

Our first contribution in this paper is to propose a 
novel and discriminative texture descriptor, LOAD, and 
demonstrate its effectiveness on two applications includ¬ 
ing texture and real-world material classification. On the 
traditional texture data sets (221 El, the LOAD almost 
saturates the classification performance. On the real- 
world Flickr Material Database (FMD) (TH, the LOAD 
achieves 65.4% that is the best result for single feature as 
far as we know. 

Our second contribution is that we build a new real- 
world material data set from a newly introduced ETHZ 
Synthesizability data set. We name the newly intro¬ 
duced data set as OULU-ETHZ. We evaluate and com¬ 
pare the LOAD with the LBP, PRICoLBP and CNN on 
the OULU-ETHZ. Experiments show that our LOAD 
achieves promising performance on the new data set. 

Our third contribution is that we experimentally 
demonstrate that the proposed LOAD shows strong com¬ 
plementary property with the learning based feature, such 
as Convolutional Neural Networks (CNN) ESI [HI- On 
the Elickr Material Database ca, our LOAD combined 
with the CNN achieves 72.5% that significantly outper¬ 


forms the CNN (61.2%) and LOAD (65.4%). On the 
OULU-ETHZ data set, the combination of the LOAD and 
CNN improves the CNN by around 6.0%. 

We believe the strong complementary information is 
due to that the lEV representation with LOAD and CNN 
belong to two different approaches: non-structured and 
structured methods. The former is robust to image rota¬ 
tion and translation, but not well captures the structured 
information. In contrast, the latter is good at capturing 
the structured information because its hierarchical max¬ 
pooling strategy can preserve the structured information, 
but is not robust to heavy image rotation and translation. 

2. Related Works 

Since the proposed descriptor is partially inspired by 
Local Binary Pattern (LBP) (231, we will give a brief in¬ 
troduction to the LBP. 

2.7. Local Binary Pattern 

LBP is an effective gray-scale texture operator. Each 
LBP pattern corresponds to a kind of local structure in 
natural image, such as fiat region, edge, contour and so on. 

Eor a pixel (xc, Vc) in an image /, its LBP image can be 
computed by thresholding the pixel values of its neighbors 
with the pixel value of the central point (xc, VcY 

> f 1 t > C 

LBPp^nixc, Vc) = Yl sign(5p-5c)2^, sign (t) = ’ 

p=0 

( 1 ) 

where P is the number of neighbors and R is the ra¬ 
dius. gc = 7(xc, Vc) is the gray value of the central pixel 
(xc, Vc), and Qp = I{xp, i/p) is the value of its p-th neigh- 
bor {xp,yp). 

Ojala et al. also pointed out that these patterns with 
at most two bitwise transitions described the fundamental 
properties of the image, and they called these patterns as 
“uniform patterns”. The number of spatial transitions can 
be calculated as follows: 

p 

^{LBPp^r{xc, Vc)) = Y I - 5c)-sign - gc)\, 

p=l 

( 2 ) 
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where gp equals to g^. The uniform patterns are defined 
as i?)) < 2. For instance, “11000011” and 

“00001110” are two uniform patterns, while “00100100” 
and “01001110” are non-uniform patterns. 

The LBP with P = 8 has 2^ = 256 patterns, in which 
there are 58 uniform patterns and 198 non-uniform pat¬ 
terns. According to the statistics in 12^ . although the 
number of uniform patterns is significantly fewer than 
the non-uniform patterns, the ratio of uniform patterns 
accounts for 80%-90% of all patterns. Thus, instead of 
the original 256 LBP, the uniform LBP is widely used in 
many applications such as face recognition. 


3. Local Orientation Adaptive Descriptor 

Our goal is to design a discriminative texture descriptor 
that owns the following two properties: 

• Regional texture discrimination: Most descrip¬ 
tors, such as SIFT, HOG2 x 2, are designed for 
image matching or human detection, not especially 
for texture description, thus their texture discrimi¬ 
nation may be limited. Although there exist effec¬ 
tive texture descriptors in literature, such as GIST, 
LBP, Completed LBP (CLBP), most of them are 
constructed for a global or fine texture description, 
thus they ignore regional texture information. In this 
work, we focus on designing a discriminatively re¬ 
gional texture descriptor. 


Figure 2: Illustration of Local Orientation Adaptive De¬ 
scriptor. The point O is the central point of the patch. 
The pattern for point A is “00001 111”, and the pattern for 
point B is “00000110”. 


• Robust to image transformations: Natural images 
contain rich image transformations, in which rota¬ 
tion and illumination variations are two most com¬ 
mon cases. Thus, when designing a feature, these 
two aspects should be carefully considered. 


In what follows we will describe the LOAD descrip¬ 
tor in detail. In Section O we describe the description 
strategy for each point under an adaptive coordinate sys¬ 
tem. Then in Section [3^ we introduce a multi-scale de¬ 
scription strategy that is used to enhance the discrimina¬ 
tive power of the descriptor. And then, we describe the 
histogram construction and normalization approaches in 


Section 3.3 Finally, in Section 3.4 we discuss the rela¬ 


tionship between the LOAD with some existing features. 


3.1. Point Description 

Given similar patches under different image rotations 
as shown in Figureour objective is to extract a kind of 
descriptor that is discriminative and transformation invari¬ 
ant. To achieve rotation invariance, the traditional meth¬ 
ods (e.g. SIFT) firstly estimate a reference orientation 
(also called main orientation), and then align the patch to 
the reference orientation. However, estimation of the ref¬ 
erence orientation will significantly increase the compu¬ 
tational cost of the descriptors. Meanwhile, as indicated 
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by (Tl, the descriptor is sensitive to the error brought by 
the orientation estimation. 

As the circular patch is symmetric with respect to any 
line across the central point, we choose to sample a circu¬ 
lar region around each point. Given a sampled point O, 
we can obtain a circular patch around the point O. By 
rotating the patch around the central point O, we can ob¬ 
tain a patch with arbitrary angle as shown in Figure For 
any point A in the patch, an Adaptive Coordinate System 
(ACS) can be formed by the point A and the reference 
point O as shown in FigureUnder the ACS, the neigh¬ 
boring relationship between point A and its neighbors is 
invariant to image rotation. It means that, as shown in 
Figure the positions of point A’s neighbors are always 
fixed compared to point A. Thus, the pixel values of the 
A’s neighbors are also invariant to image rotation. 

Under the ACS, any point in the patch can be encoded 
in a rotation invariant way. In this paper, we propose a 
novel Local Orientation Adaptive Descriptor (LOAD) that 
is built on ACS. As illustrated in Figurethe LOAD pat¬ 
tern for the point A can be encoded as follows: 

p-i 

LOADp^R(xA,yA,ffA) = L] sign(F(^p) -V(A))2P, 

p=0 

(3) 

XAp =XA+ Rcos{2np/P - 9a), 

UAp =yA- Rsin{2Trp/P - 9a), 

where P is the number of neighbors, R is the radius, 
{xa^Va) and (xAp^VAp) are the positions of the cen¬ 
tral point A and its p-th neighbor under the ACS, V{A) 
and V{Ap) denote the pixel values of points (xa^Va) 
and (xAp^VAp) individually, sign(') is a sign function, 
Oa = arctan . 

In the same way, under the ACS, the adaptive gradient 
magnitude for the point A can be denoted as follows: 

M(A) = ^(V(A4) - V{Ao))^ + iV{Ae) - V{A2))^ 

(4) 

where the M(A) is computed when R = 1. 

The encoding approach as Eq. has two advanced 
properties: (i) Rotation invariance: Under the ACS, 
the neighboring relationships between one point and its 
neighbors are fixed. As shown in Figure the same start 
point Aq will always be selected for the point A. Thus, 
the LOAD encoding is rotation invariant, (ii) Robustness 
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Figure 3: Multi-scale Local Orientation Adaptive De¬ 
scriptor. The pattern for the inner scale is “10001111”, 
and the pattern for the outer scale is “10000011”. 

to illumination variation: Using the binary sequence de¬ 
scription approach, our LOAD is also robust to illumina¬ 
tion variation because illumination variation usually does 
not change the binary comparison relationship between 
two adjacent pixels. 

According to Eq. when P is set to 8, the LOAD will 
have 256 patterns that may be high for a local descriptor. 
Motivated by the “Uniform” encoding in LBP 12^ . we 
also adopt the “Uniform” strategy the LOAD. Thus, the 
dimension of the LOAD is 59. 

3.2. Multi-scale Description 

Multiresolution analysis-also called multi-scale 
analysis-is an effective way to depict texture information 
in different scales. Multi-scale strategy is widely used 
in the LBP (231 and its variants |[T0l[TT][38l. As pointed 
out by previous works, the multi-scale description 
performs significantly better than the single-scale one. 
The multi-scale version of the LOAD can be defined as 
follows: 

p-i 

l.OAT>pAxA,yA,9A,s) = ^ sign(y(Ap) -V{A))2P, 

p=0 

(5) 

XAp = xa s X Rcos{2'KpIP — Oa), 

PAp = Pa- s X Rsin(27rplP - Oa), 
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where s is a scale factor. 

Compared to the Eq. we introduce a scale factor to 
the Eq. With choice of different scale factors, we can 
obtain LOAD patterns in different scales. Eigurej^ shows 
the LOAD with two scales. In practice, we can choose 2, 3 
or 4 scales. As shown in Eigurej^ the binary sequence for 
the inner scale is “10001111”, and the binary sequence 
for outer scale is “10000011”. If the patterns between 
inner and outer scales are similar, it may indicate that the 
structures around this point is consistent, and vice versa. 


Algorithm 1 Calculation of LOAD feature 

Input: One reference point O and a circular patch P 
around O; 

Output: LOAD histogram feature H 
1: Initiate a 2-D histogram H with zeros, the size of H is 
set as 59 x S; 

2: for all Oi e P do 

3: Compute the gradient orientation M(O^) of the 

4: point Oi as Eq. 

5: for each 5 G [1, S'] do 

6: Calculate the uniform LOAD pattern with go 

7: as shown in Eigurej^as start point, denote it 

8: asUs(Oi), 

9: Accumulate the histogram H, 

10: H(U,(O0. = H(U,(O0. s) + 

11: end for 

12: end for 

13: Resize the histogram H into 1-D vector and normalize 
it with square root norm, 

14: Return H. 


3.3. Histogram Construction and Normalization 

Given a circular patch with the point O as the central 
point, suppose that the patch has K points. Assume that 
we use S scales, the dimension for each scale is 59, thus, 
the final feature dimension is 59 x S. We initiate a 2-D 
histogram H with all zeros. Then, for each point Oi^ i ^ 
[1, AT], we can accumulate the histogram H as follows: 

H(U,(0,), 5) = H(U,(0,), 8) + (6) 

where s G [1, S'], M(O^) is the gradient magnitude of 
point Oi under the ACS as computed according to the Eq. 


1^ Us(Oi) is the “Uniform” pattern of the LOAD feature 
of the point Oi at the scale s. 

After accumulating all K points in the patch into the 
histogram H, we resize the histogram into 1-D vector. 

Eeature normalization is an important step for both fea¬ 
ture description (e.g. RootSIET HI) and image repre¬ 
sentation MM- In this paper, we follow the opera¬ 
tor in RootSIET, and conduct square root operation to our 
LOAD. Previous works HI [Ml have shown that the square 
root normalization performs better than L 2 normalization. 

Eor clarity, we summarize the algorithm for calculating 
the LOAD feature in Algorithm[^ in which S is the num¬ 
ber of scales, (Oi) is the uniform pattern representation 
of the LOAD feature of the point Oi at scale s. 

3.4. Relationship to Other Features 

Our LOAD feature is related to some existing features 
in the literature. The first category of related features 
are the LBP based methods, e.g. LBP (231, CLBP (TOl . 
Another set of related features are Local Intensity Order 
based methods including MORGH |[7|, LIOP (361. How¬ 
ever, different from the LBP based methods, our LOAD 
has the following two properties: 

• Regional texture discrimination: Our LOAD is a 
patch-based feature. However, the LBP based meth¬ 
ods, e.g. LBP, CLBP, were designed to depict micro¬ 
structures. Image representation based on LBP is 
to compute the histogram of patterns, but the image 
representation with the LOAD uses the BoW model. 

• Trade-off between rotation invariance and discrimi¬ 
native power: Our LOAD descriptor for each point is 
built on the ACS. Thus, the LOAD not only achieves 
good robustness to image rotation but also has strong 
discriminative power. On the other hand, the LBP 
based methods achieve rotation invariance at the cost 
of discriminative power. 

Different from LIOP and MORGH, our LOAD has the 
following two properties: 

• Richer patterns: The LOAD adopts a binary pattern 
description. Using the binary pattern descriptor, our 
LOAD has richer patterns than LIOP (16 patterns) 
and MORGH (8 patterns) on a single point. 
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• Robust to the sensitiveness of region division: the 
LOAD does not employ the region division. In¬ 
tensity order based region division may be 

sensitive to non-monotonous illumination variation. 
Meanwhile, the region division will greatly increase 
the feature dimension. 


4. Encoding 


The Improved Fisher Vector (IFV) 1251 encoding has 
been proposed to address the problem of information 
loss in the process of feature encoding in the traditional 
BoW model. Within the context of IFV, images are 
represented by encoding densely sampled local descrip¬ 
tors. Principal Component Analysis (PCA) is firstly used 
to remove the correlation between two arbitrary dimen¬ 
sions. In PC A, we keep D components. Then, a Gaus¬ 
sian mixture model (GMM) is estimated to build the 
visual words for the after-PC A local descriptors. The 
IFV measures the normalised deviations of local descrip¬ 
tors w.r.t. the GMM parameters. More specifically, let 
I = {xt, t = 1 • • • T} that are the set of D-dimensional 
after-PC A local descriptors extracted from an image. De¬ 
note the set of parameters of a D-component GMM by 
A = = 1, • • • ,K}, where Hk, and Sfe 

are the prior, mean vector, and covariance matrix for the k- 
th components respectively. Given Xf with a soft assign¬ 
ment Xtk to each of the K components, the IFV encoding 
of / is defined as follows: 


with 

where 


4^k ip^t) 




t=l 


(7) 


= [01 {Xt)r- ^(I^K {Xt)], 


( 8 ) 


^tk f^k ^tk 


yj'^k ^k \/27r/0 


{xt - llkf 


- 1 


2D 


A: = !,••• ,iT. 

(9) 


The IFV encoding is a vector representation of 2D x K 
dimensions. In the IFV, the power (signed square root) 


normalization usually shows better performance than the 
L 2 Normalization. 

Compared with the BoW with K-means, the IFV frame¬ 
work provides a more general way to represent an im¬ 
age by a generative process of local descriptors and can 
be efficiently computed from much smaller vocabular¬ 
ies. Chatfield et al. m evaluated the state-of-the-art en¬ 
coding methods such as the IFV, the Super Vector and the 
Locality-constrained linear (LLC), and showed that the 
IFV performs best in all compared encodings. 


5. Experiments 

5.7. Implementation Details 

LOAD. In LOAD, we use four scales ((8, 1), (8, 2), 
(8, 3) and (8, 4)). The dimension for each scale is 59, 
thus, the final dimension is 236. Experiments show that 
the performance of four scales usually slightly improves 
the performance of two scales (e.g. (8, 1) and (8, 3)). 

IFV. We firstly sample 100,000 LOAD features from 
the training samples, then the 100,000 LOAD features 
are used to learn the PC A components, and 100 prin¬ 
cipal components are preserved as the basis for dimen¬ 
sion reduction. As pointed out by ll27]| . the PCA, which 
is used to remove correlation between two arbitrary di¬ 
mensions, is a key step in the IFV framework. With the 
above-mentioned 100,000 after-PCA LOAD features, we 
learn a GMM with 256 components. For the PCA, we use 
the Matlab built-in SVD (Singular Value Decomposition). 
For the GMM, we use Vlfeat to learn the parameters 
0 = {tt/c, /i/c, S/c, /c = 1, • • • , Ff}. In the IFV, the T^k is 
forced to be diagonal. The final dimension of the IFV 
representation for each image is 2 x 100 x 256 = 51, 200. 

Classifier. We trained a 1-vs-all linear SVM classifier 
(with C=10) using Liblinear ID toolbox. 

It should be pointed out that the computational cost for 
our LOAD descriptor is low. On a desktop computer with 
dual-core 3.4G CPU, the (Matlab mex) implementa¬ 
tion takes about 2s to extract 8000 features. 

5.2. Evaluation of Properties 

Rotation Invariance. To evaluate the rotation in¬ 
variance of the LOAD feature, we use three data sets: 
Outex_TC_00010 (TCIO), Outex_TC_00012 (TC12) and 
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UIUC. The experimental setups for each data set are pre¬ 
sented in the following application section. We compare 
the LOAD with RootSIFT. We guarantee that the LOAD 
and RootSIFT uses the same number of features and the 
same framework of IFV presentation. The experimental 
results for both features are shown in Tab. [T] 

Table 1: Evaluation of Rotation Invariance of the LOAD 
on TC10, TCI2 amd UIUC data sets. 



UIUC 

TCIO 

TCI2 

RootSIFT 

97.1 

48.78 

53.98 

54.56 

LOAD 

99.6 

99.95 

99.65 

99.33 


According to Table we have two observations: (1). 
On the data sets with strong rotation such as TCIO and 
TC12, the LOAD shows great robustness to image rota¬ 
tion and significantly outperforms the RootSIFT. (2) On 
the UIUC data set that has small image rotations, our 
LOAD still shows better performance than the RootSIFT. 

Discriminative Power. To access the discriminative 
power of the LOAD, we directly compare it with the 
RootSIFT. We compare them in two sampling strategies: 
single-scale and multi-scale sampling. For single-scale, 
we directly sample points on the original images. For 
multi-scale sampling, we densely extracted features from 
six scales with rescaling factors i = — 1, 0,1,..., 4. 

We evaluate the LOAD and RootSIFT on Flickr Mate¬ 
rial Database (FMD) and UIUC data sets. The results are 
shown in Table |2l 

Table 2: Comparison of the LOAD and RootSIFT on 
FMD and UIUC data sets. 


Sampling Strategy 

Features 

FMD 

UIUC 

Single-scale 

RootSIFT 

56.5 

96.1 

LOAD 

62.1 

99.3 

Multi-scale 

RootSIFT 

60.5 

97.1 

LOAD 

65.4 

99.6 


From Table on both single-scale and multi-scale 
sampling strategies, our LOAD outperforms the Root¬ 
SIFT. For instance, with single-scale sampling, our 
LOAD improves the RootSIFT by 5.6% on FMD data set. 
Meanwhile, we can also find that the multi-scale sam¬ 
pling strategy consistently outperforms the single-scale 



(c). UIUC 


Figure 4: Sample images from TCIO, TC12 and UIUC 
texture data sets. Note that TCIO and TCI2 have strong 
rotation variation, and UIUC has strong rotation, scale 
and affine transformation. 

sampling strategy. 

5.3. Texture Classification 

Outex ^T2\ database has two test suites- 
OutexTCOOOlO (TCIO) and Outex_TC_00012 
(TC12). The two test suites contain the same 24 classes 
of textures, which were collected under three different 
illuminations (horizon, inca, and tl84) and nine different 
rotation angles (0, 5, 10, 15, 30, 45, 60, 75, and 90 ). 
There are 20 non-overlapping 128 x 128 texture samples 
for each class. For TCIO, samples of illuminations “inca” 
with angle 0 in each class were used for training and the 
other eight rotation angles with the same illuminations 
were used for testing. Hence, there are 480 (24 x 20) 
training samples and 3,840 (24 x 20 x 8) validation 
samples. For TCI2, the classifier was trained with 
the same training samples as TCIO, and it was tested 
with all samples captured under illuminations “tl84” 
or “horizon”. Hence, there are 480 (24 x 20) training 
samples and 4,320 (24 x 20 x 9) validation samples for 
each illumination. It should be noted that the training 
images come from only one angle, but the testing images 
come from different angles. 

UIUC ca texture data set contains 1,000 images: 25 
different texture categories with 40 samples in each cate- 
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gory. The image size in the data set is 640 x 480. This 
data set has strong rotation and scale variations. In the 
experiments, 20 samples from each category are used for 
training, and the rest 20 samples are used for testing. 

Sample images for above three data sets are shown in 
Fig. 1^ For all three data sets, we densely extracted fea¬ 
tures from six scales with rescaling factors = 

—1,0,1,...,4. We use the IFV representation and linear 
SVM. The results of TC10, TCI2 and UIUC data sets are 
shown in Table |3] 


Methods 

TCIO 

TC12 

Dense SIFT (SVM) 

48.78 

53.98 

54.56 

CLBP SM/C (NN> riOl 

99.14 

95.18 

95.55 

BRINT (NN) (201 

99.35 

97.69 

98.56 

BRINT (SVM) (201 

99.30 

98.13 

98.33 

LOAD (SVM) 

99.95 

99.65 

99.33 


(a) Experimental results on data sets TCIO and TCI2. 


Methods 

Acc. 

Methods 

Acc. 

Lazebnik et al.ll5l 

96.0 

WMFS (371 

98.6 

BIFlU 

98.8 

SRPIfTOl 

98.56 

Sifre et al.l30l 

99.4 

RootSIFT 

97.0 

Cimpoi et al.l[3| 

99.0 

LOAD 

99.6 


(b) Experimental results on UIUC data set. 


Table 3: Comparison with state-of-the-art methods on 
TCIO, TC12 and UIUC texture data sets. 

Table [^a) shows that the rotation invariant methods 
including CLBP, BRINT and LOAD significantly out¬ 
perform the rotation sensitive method (Dense SIFT with 
IFV). Meanwhile, among all rotation invariant methods, 
our LOAD works best. According to Table [^b), on UIUC 
data set, our LOAD also outperforms the state-of-the-art 
methods including SRP and two newly published 
works (SOJO. 

5.4. Real-World Material Classification 

Flickr Material Dataset (FMD) (Wl is a challenging 
real-world material data set. It contains 10 categories, in¬ 
cluding fabric, foliage, glass, leather, metal, paper, plas¬ 
tic, stone, water, and wood. As pointed out in 1291 , FMD 
was designed with specific goal of capturing the appear¬ 
ance variations of real-world materials, and by including 



Fabric Foliage Glass Leather Metal 



Paper Plastic Stone Water Wood 


Figure 5: Sample images of 10 categories from the FMD 
data set. 


a diverse selection of samples in each category. Each 
category in FMD has 100 images, where 50 images are 
used for training and the rest 50 images are used for test¬ 
ing. Samples images are shown in Figure We use the 
multi-scale sampling and densely extracted features from 
six scales with rescaling factors 2“*/^, i = —1,0,1,...,4. 
The step size for our sampling is 4. For instance, about 
43,000 points are sampled from each image. 

In the experiments, we compare our feature with many 
state-of-the-art methods including Kernel Descriptor ca, 
Pairwise Rotation Invariance Co-occurrence LBP (PRI- 
CoLBP) l26l , DTD (a texture attribute descriptor) ^ and 
CNISQCSI and etc. 

This paper investigates two key issues: (1) how 
much does the proposed feature depend on the dictio¬ 
nary (Learned by GMM) in IFV? (2) how much com¬ 
plementary information can the learning-based methods 
(e.g. CNN) provide for the LOAD feature with the IFV 
representation? For the first question, we compare the 
LOAD with the IFV representation using the vocabularies 
learned from the FMD or from an external data set. We 
randomly select 500 images from m as the external data 
set. For the second question, we evaluate the combination 
of our LOAD with the CNN feature. All relevant results 
are shown in Table and three classification confusion 
matrices for the CNN, the LOAD and the combination of 


^We use OverFeat(^ toolbox in this paper. 
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LOAD + CNN(72.5) 


Figure 6: Classification confusion matrices for CNN, LOAD and the combination of CNN and LOAD on FMD data 
set. 


the CNN and the LOAD are shown in Figure 

Table 4: Comparison of state-of-the-art methods on FMD 
data set. LOAD* means using vocabulary learning from 
an external data set. Note that the recognition accuracy 
for humans on the FMD is 84.9% reported in 1291 . 


Methods 

Accuracy 

Liu etal.CVPR’lOQl 

44.6 

Huetal. BMVC’ll 03 

49 

Oi etal.ECCV’12 12^ 

57.1 ±1.8 

Li etal.ECCV’ 1211171 

48.1 

Sharan etal.LTCV’13 l29l~ 

57.1 

DTDCVPR’14 0 

49.8 ± 1.3 

Features Combined O 

67.1 ±1.5 

CNNII281 

61.2 ±1.9 

LOAD* 

64.6 ± 1.7 

LOAD 

65.4 ± 1.7 

LOAD* CNN 

72.5 ±1.4 


From Table 1^ we can observe that: 

• The LOAD achieves better performance than previ¬ 
ous works including the methods with single feature, 
such as Kernel Descriptor, DTD, PRICoLBP. Mean¬ 
while, it also outperforms some methods with mul¬ 
tiple features, such as Liu et al. CD and Sharan et 
al.||29l. Their results are based on combination of 
seven features. 


• The LOAD combined with the CNN significantly 
improves both of them. The combination of the 
CNN and LOAD decreases the error rate of LOAD 
by about 20%, and decreases the error rate of CNN 
by about 30%. 

• The LOAD is not sensitive to the source of the vo¬ 
cabulary. The LOAD with vocabulary learning from 
FMD only slightly improves the LOAD* with vocab¬ 
ulary learning from an external data set. 

We can find that, from Figure the performances for 
the CNN and the LOAD on the corresponding categories 
vary a lot, such as the categories “foliage”, “metal” and 
“stone”. Meanwhile, we observe that on several cate¬ 
gories, such as “fabric” and “glass”, the LOAD combined 
with the CNN improves the one with lower classification 
accuracy by more than 10%. 

Discussion. We believe the reason behind the signif¬ 
icant increase of classification performance is that the 
CNN and IFV representations belong to two different ap¬ 
proaches: structured and non-structured. The CNN is 
the structured method that is discriminative in capturing 
spatial layout information. With the hierarchical max¬ 
pooling strategy, the structured information is well pre¬ 
served and captured. However, on the other hand, the 
CNN may be not robust to heavy image rotation and trans¬ 
lation. In contrast, the IFV representation with the LOAD 
feature is robust to image rotation and translation, but not 
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powerful in describing spatial structure information. We 
believe this is the reason why these two methods have 
strong complementary information. 

A New Material Dataset (OULU-ETHZ) is intro¬ 
duced in this paper. The new data set is compiled from 
a new introduced ETHZ Synthesizability data sej^ that 
contains rich material images. The ETHZ data set is de¬ 
signed to evaluate the Synthesizability of images, but not 
designed for material recognition. In this paper, we select 
13 material categories from this data set, and construct a 
new data set for material recognition. 

All 13 categories include Cloud, Eabric, Elour, Eur, 
Glass, Grass, Leather, Metal, Paper, Plastic, Sand, Wa¬ 
ter and Wood. The number of the images in each category 
ranges from 44 to 420. Deriving from the ETHZ data set, 
the image sizes for all samples are 300 x 300 pixels. Some 
sample images are shown in EigurejT] 

The OULU-ETHZ and EMD data sets share some sim¬ 
ilar properties and also have some differences. These sim¬ 
ilar and different properties are: 

• The images in both EMD and OULU-ETHZ are both 
collected from real-world material images. Rich ap¬ 
pearance variation happens in both data sets. Eor 
instance, the “Air” category in Eigure [7] has shown 
huge illumination variation. 

• Compared to the EMD data set, most of the images 
in the OULU-ETHZ are close-up images, thus, bet¬ 
ter alignment is shown in the OULU-ETHZ. It means 
that the images in the OULU-ETHZ has stronger 
scale and rotation prior than the EMD. 

To evaluate different algorithms, we use 20 samples 
for training and the rest for testing. We pre-create five 
training-testing configurations, averaged accuracy is re¬ 
ported. We compare the proposed LOAD with two base¬ 
line methods (EBP, PRICoLBP) and also with CNN ap¬ 
proached The results are shown in Table 

Prom Table 13 we can observe that: 


^The ETHZ Synthesizability data set contains 21302 texture of 
300 X 300 pixels, downloaded with 60 keywords. 

^Following (26), we use kernel for EBP and PRICoLBP. In the 
experiments, LBP uses three scales and PRICoLBP uses 6 templates. 
The dimensions for LBP and PRICoLBP are 54 and 3540 individually. 
We use linear SVM for our LOAD and CNN. 


Table 5: Experimental results on the OULU-ETHZ set. 


Methods 

Accuracy 

LBP (Gray) 1231 

38.6 ± 1.2 

PRICoLBP (Gray) (26l 

50.5 ±1.8 

PRICoLBP (Colorl l26l " 

52.8 ± 1.6 

SIET(IEV) 

53.2 ±1.9 

CNN 

62.1 ± 1.5 

LOAD(IEV) 

55.9 ±2.0 

LOAD -f CNN 

67.7 ± 1.6 


• The CNN achieves the best result among all com¬ 
pared approaches, our LOAD ranks second. The 
LOAD outperforms the LBP, PRICoLBP and SIPT. 

• The LOAD shows strong complementary property 
with the CNN. The combination of them improves 
the CNN by about 6%. 

Discussion. It is interesting to investigate the reasons 
why the LOAD performs better than the CNN on the 
EMD, but worse than the CNN on OULU-ETHZ. We be¬ 
lieve the following two points may be two main reasons: 

• The OULU-ETHZ shows better consistency in ap¬ 
pearance (e.g. color). The CNN is built on color 
image, and the LOAD is extracted from gray im¬ 
age. We believe that the OULU-ETHZ may have 
stronger color prior than the EMD. This argument 
can be validated by the fact that color PRICoLBP 
shows better performance than gray PRICoLBP on 
the OULU-ETHZ, but only achieves similar perfor¬ 
mance as gray PRICoLBP on the EMD. The consis¬ 
tency of appearance on the OULU-ETHZ is impor¬ 
tant for the CNN. 

• Most of the images in the OULU-ETHZ are close- 
up images. The close-up images have strong align¬ 
ment on scale. Meanwhile, due to the skews when 
collecting the ETHZ data set, the images also have 
good alignment on rotation. The scale and rotation 
are two difficult issues to handle in the CNN. 

6. Conclusion 

This paper proposed a novel Local Orientation Adap¬ 
tive Descriptor (LOAD) to capture regional texture infor- 
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Cloud Fabric Flour Fur Glass Grass Leather Metal Paper Plastic Sand Water Wood 


Figure 7: OULU-ETHZ real-world material data set. The OULU-ETHZ has rich image transformations. 


mation for image classification. It enjoys not only dis¬ 
criminative power to capture the texture information, but 
also has strong robustness to illumination variation and 
image rotation. Superior performance on texture and real- 
world material classification tasks fully demonstrate its 
effectiveness. Meanwhile, it also shows strong comple¬ 
mentary property with the learning-based method (e.g. 
Convolutional Neural Networks). The LOAD combined 
with the CNN significantly outperforms both of them. We 
believe the strong complementary information is due to 
that the IFV representation with LOAD feature and CNN 
belong to two different approaches: non-structured and 
structured approaches. The former is robust to image ro¬ 
tation and translation, but not well captures the structured 
information. In contrast, the latter is good at capturing the 
structured information because of its hierarchical max¬ 
pooling strategy, but is not robust to heavy image rotation 
and translation. Therefore, they exhibit strong comple¬ 
mentary property. 
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